INTERSPEECH.2023 - Language and Multimodal | Cool Papers

#1 Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer [PDF²] [Copy] [Kimi²]

Authors: Paul-Ambroise Duquenne ; Holger Schwenk ; Benoît Sagot

Recent research has shown that independently trained encoders and decoders, combined through a shared fixed-size representation, can achieve competitive performance in speech-to-text translation. In this work, we show that this type of approach can be further improved with multilingual training. We observe significant improvements in zero-shot cross-modal speech translation, even outperforming a supervised approach based on XLSR for several languages.

#2 Improving Isochronous Machine Translation with Target Factors and Auxiliary Counters [PDF¹] [Copy] [Kimi¹]

Authors: Proyag Pal ; Brian Thompson ; Yogesh Virkar ; Prashant Mathur ; Alexandra Chronopoulou ; Marcello Federico

To translate speech for automatic dubbing, machine translation needs to be isochronous, i.e. translated speech needs to be aligned with the source in terms of speech durations. We introduce target factors in a transformer model to predict durations jointly with target language phoneme sequences. We also introduce auxiliary counters to help the decoder to keep track of the timing information while generating target phonemes. We show that our model improves translation quality and isochrony compared to previous work where the translation model is instead trained to predict interleaved sequences of phonemes and durations.

#3 StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation [PDF¹] [Copy] [Kimi¹]

Authors: Kun Song ; Yi Ren ; Yi Lei ; Chunfeng Wang ; Kun Wei ; Lei Xie ; Xiang Yin ; Zejun Ma

Direct speech-to-speech translation (S2ST) has gradually become popular as it has many advantages compared with cascade S2ST. However, current research mainly focuses on the accuracy of semantic translation and ignores the speech style transfer from a source language to a target language. The lack of high-fidelity expressive parallel data makes such style transfer challenging, especially in more practical zero-shot scenarios. To solve this problem, we first build a parallel corpus using a multi-lingual multi-speaker text-to-speech synthesis (TTS) system and then propose the StyleS2ST model with cross-lingual speech style transfer ability based on a style adaptor on a direct S2ST system framework. Enabling continuous style space modeling of an acoustic model through parallel corpus training and non-parallel TTS data augmentation, StyleS2ST captures cross-lingual acoustic feature mapping from the source to the target language. Experiments show that StyleS2ST achieves good style similarity and naturalness in both in-set and out-of-set zero-shot scenarios.

#4 Joint Speech Translation and Named Entity Recognition [PDF] [Copy] [Kimi³]

Authors: Marco Gaido ; Sara Papi ; Matteo Negri ; Marco Turchi

Modern automatic translation systems aim at supporting the users by providing contextual knowledge. In this framework, a critical task is the output enrichment with information regarding the mentioned entities. This is currently achieved by processing the generated translations with named entity recognition (NER) tools and retrieving their description from knowledge bases. In light of the recent promising results shown by direct speech translation (ST) models and the known weaknesses of cascades (error propagation and additional latency), in this paper we propose multitask models that jointly perform ST and NER, and compare them with a cascade baseline. Experimental results on three language pairs (en-es/fr/it) show that our models significantly outperform the cascade on the NER task (by 0.4-1.0 F1), without degradation in terms of translation quality, and with the same computational efficiency of a plain direct ST model.

#5 Analysis of Acoustic information in End-to-End Spoken Language Translation [PDF¹] [Copy] [Kimi¹]

Authors: Gerard Sant ; Carlos Escolano

End-to-End Transformer-based models are the most popular approach for Spoken Language Translation (SLT). While obtaining state-of-the-art results, we are still far from understanding how these models extract acoustic information from the data and how they are transformed into semantic representations. In this paper, we seek to provide a better understanding of the flow of acoustic information along speech-to-text translation models. By means of the Speaker Classification and Spectrogram Reconstruction tasks, this study (i) interprets the main role of the encoder with respect to the acoustic features, (ii) highlights the importance of the acoustic information throughout the model and its transfer between encoder and decoder, and (iii) reveals the significant effect of downsampling convolutional layers for learning acoustic features. (iv) Finally, we also observe the existence of a strong correlation between the semantic domain and the speakers' labels in MuST-C.

#6 LAMASSU: A Streaming Language-Agnostic Multilingual Speech Recognition and Translation Model Using Neural Transducers [PDF] [Copy] [Kimi¹]

Authors: Peidong Wang ; Eric Sun ; Jian Xue ; Yu Wu ; Long Zhou ; Yashesh Gaur ; Shujie Liu ; Jinyu Li

Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure. It is thus possible to use a single transducer model to perform both tasks. In real-world applications, such joint ASR and ST models may need to be streaming and do not require source language identification (i.e. language-agnostic). In this paper, we propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers. Based on the transducer model structure, we propose four methods, a unified joint and prediction network for multilingual output, a clustered multilingual encoder, target language identification for encoder, and connectionist temporal classification regularization. Experimental results show that LAMASSU not only drastically reduces the model size but also reaches the performances of monolingual ASR and bilingual ST models.

#7 Lightweight and Efficient Spoken Language Identification of Long-form Audio [PDF] [Copy] [Kimi²]

Authors: Winstead Zhu ; Md Iftekhar Tanveer ; Yang Janet Liu ; Seye Ojumu ; Rosie Jones

State-of-the-art Spoken Language Identification (SLI) systems usually focus on tackling short audio clips, and thus their performance degrade drastically when applied to long-form audio, such as podcast, which poses peculiar challenges to existing SLI approaches due to its long duration and diverse content that frequently involves multiple speakers as well as various languages, topics, and speech styles. In this paper, we propose the first system to tackle SLI for long-form audio using podcast data by training a lightweight, multi-class feedforward neural classifier using speaker embeddings as input. We demonstrate that our approach can make inference on long audio input efficiently; furthermore, our system can handle long audio files with multiple speakers and can be further extended into utterance-level inference and code-switching detection, which is currently not covered by any existing SLI system.

#8 End to End Spoken Language Diarization with Wav2vec Embeddings [PDF] [Copy] [Kimi¹]

Authors: Jagabandhu Mishra ; Jayadev N Patil ; Amartya Chowdhury ; Mahadeva Prasanna

The performance of the available end-to-end (E2E) spoken language diarization (LD) systems is biased toward primary language. This is due to the unavailability of sufficient secondary language data. Because in code-switched (CS) utterances, the duration of the primary language is significant over the secondary language. Hence, to resolve the issue, this work initially uses wav2vec (W2V) pre-trained embedding in place of x-vector and can reduce the primary language bias and provides a relative improvement of 30.7% in terms of Jaccard error rate (JER) over the baseline x-vector based E2E (X-E2E) framework. Further, the performance of LD is improved by fine-tuning the W2V embedding extractor and modifying the temporal aggregation strategy from statistical pooling to attention pooling. The Final performance achieved in terms of JER is 22.5, which provides a relative improvement of 38.8% and 62.6% over the standalone W2V fine-tuned and the baseline X-E2E framework, respectively.

#9 Efficient Spoken Language Recognition via Multilabel Classification [PDF] [Copy] [Kimi¹]

Authors: Oriol Nieto ; Zeyu Jin ; Franck Dernoncourt ; Justin Salamon

Spoken language recognition (SLR) is the task of automatically identifying the language present in a speech signal. Existing SLR models are either too computationally expensive or too large to run effectively on devices with limited resources. For real-world deployment, a model should also gracefully handle unseen languages outside of the target language set, yet prior work has focused on closed-set classification where all input languages are known a-priori. In this paper we address these two limitations: we explore efficient model architectures for SLR based on convolutional networks, and propose a multilabel training strategy to handle non-target languages at inference time. Using the VoxLingua107 dataset, we show that our models obtain competitive results while being orders of magnitude smaller and faster than current state-of-the-art methods, and that our multilabel strategy is more robust to unseen non-target languages compared to multiclass classification.

#10 Description and Analysis of ABC Submission to NIST LRE 2022 [PDF] [Copy] [Kimi¹]

Authors: Pavel Matejka ; Anna Silnova ; Josef Slavíček ; Ladislav Mosner ; Oldřich Plchot ; Michal Klčo ; Junyi Peng ; Themos Stafylakis ; Lukáš Burget

This paper summarizes our efforts in the NIST Language Recognition Evaluations 2022 resulting in systems providing competitive performance. We provide both the description and analysis of the systems. We describe what data we have used to train our models, and we follow with embedding extractors and backend classifiers. After covering the architecture, we concentrate on post-evaluation analysis. We compare different topologies of DNN, different backend classifiers, and the impact of the data used to train them. We also report results with XLS-R pre-trained models. We present the performance of the systems in the Fixed condition, where participants are required to use only predefined data sets, and also in the Open condition allowing to use any data to train the systems.

#11 Exploring the Impact of Pretrained Models and Web-Scraped Data for the 2022 NIST Language Recognition Evaluation [PDF] [Copy] [Kimi]

Authors: Tanel Alumäe ; Kunnar Kukk ; Viet-Bac Le ; Claude Barras ; Abdel Messaoudi ; Waad Ben Kheder

This paper describes Vocapia-TalTech team systems developed for the 2022 NIST Language Recognition Evaluation (LRE22) which focused on spoken language identication of African languages, including low-resource languages. In both fixed and open conditions, our primary systems were fused from multiple individual systems using logistic regression. In the fixed condition, we largely relied on wav2vec2.0 conformer models pretrained on the provided training data. In the open condition, we used external pretrained wav2vec2.0 models, phonotactic models and features derived from a multilingual speech recognition system, and also augmented the provided target language development data with additional data scraped from the web. On the LRE22 evaluation data, our final fixed and open condition systems obtained excellent results, with primary metric Cact values of 0.111 and 0.067, respectively. A post-evaluation study shows that both pretrained models as well as additional data are important for accurate models.

#12 Advances in Language Recognition in Low Resource African Languages: The JHU-MIT Submission for NIST LRE22 [PDF] [Copy] [Kimi¹]

Authors: Jesús Villalba ; Jonas Borgstrom ; Maliha Jahan ; Saurabh Kataria ; Leibny Paola Garcia ; Pedro Torres-Carrasquillo ; Najim Dehak

We present the effort of JHU-CLSP/HLTCOE and MIT Lincoln labs for NIST Language Recognition Evaluation (LRE) 2022. LRE22 consisted of a language detection task, i.e., determining whether a given target language was spoken in a speech segment. LRE22 focused on telephone and broadcast narrowband speech in African languages. Since LRE17, there has been large progress in neural embeddings, combined or not, with self-supervised models like Wav2Vec2. Therefore, one of our goals was to investigate these new models, i.e., ECAPA-TDNN, Res2Net or Wav2Vec2+ECAPA-TDNN, in the LRE scenario. In the fixed training condition, LRE22 target languages were only included in a small development set. Hence, we focused on tuning our models to exploit the limited data. For the open condition, we built a massive training set including African data, which improved Cprimary by 50% w.r.t. fixed. Wav2Vec2 embeddings were the best, outperforming ECAPA and Res2Net by 11 and 3%, respectively.

#13 Detection of Emotional Hotspots in Meetings Using a Cross-Corpus Approach [PDF] [Copy] [Kimi¹]

Authors: Georg Stemmer ; Paulo Lopez Meyer ; Juan Del Hoyo Ontiveros ; Jose Lopez ; Hector A. Cordourier ; Tobias Bocklet

Speech emotion recognition for natural human-to-human conversations has many useful applications, including generating comprehensive meeting transcripts or detecting communication problems. We investigate the detection of emotional hotspots, i.e., regions of increased speaker involvement in technical meetings. As there is a scarcity of annotated, not-acted corpora, and to avoid introducing unwanted biases to our models, we follow a cross-corpus approach where models are trained on data from domains unrelated to the test data. In this work we propose a model ensemble trained on spontaneous phone conversations, political discussions and acted emotions. Evaluation is performed on the natural ICSI and AMI meeting corpora, where we used existing hotspot annotations for ICSI and created labels for the AMI corpus. A semi-supervised fine-tuning procedure is introduced to adapt the model. We show that an equal error rate of below 21% can be achieved using the proposed cross-corpus approach.

#14 Detection of Laughter and Screaming Using the Attention and CTC Models [PDF] [Copy] [Kimi¹]

Authors: Takuto Matsuda ; Yoshiko Arimoto

This study aimed to detect social signals, such as laughter and screams, in real environments. Social signals influence human-to-human communication. To effectively apply these signals in various systems, computer systems must appropriately detect social signals. In this study, social signal detection (SSD) experiments were conducted to demonstrate which of three feature sets, i.e., a spectral feature set, prosodic feature set, and spectral and prosodic feature set, was best for detecting laughter and screaming. The results showed that using both the spectral and prosodic feature sets yielded the best performance, with 81.83% accuracy for laughter and 81.68% accuracy for screams. Moreover, the detection model comparison results revealed that the bidirectional long short-term memory (BiLSTM)-connectionist temporal classification (CTC) yielded the best laughter detection performance, while attention-CTC was best for scream detection. These results suggest that CTC is effective for SSD.

#15 Capturing Formality in Speech Across Domains and Languages [PDF] [Copy] [Kimi¹]

Authors: Debasmita Bhattacharya ; Jie Chi ; Julia Hirschberg ; Peter Bell

The linguistic notion of formality is one dimension of stylistic variation in human communication. A universal characteristic of language production, formality has surface-level realizations in written and spoken language. In this work, we explore ways of measuring the formality of such realizations in multilingual speech corpora across a wide range of domains. We compare measures of formality, contrasting textual and acoustic-prosodic metrics. We believe that a combination of these should correlate well with downstream applications. Our findings include: an indication that certain prosodic variables might play a stronger role than others; no correlation between prosodic and textual measures; limited evidence for anticipated inter-domain trends, but some evidence of consistency of measures between languages. We conclude that non-lexical indicators of formality in speech may be more subtle than our initial expectations, motivating further work on reliably encoding spoken formality.

#16 Towards Robust Family-Infant Audio Analysis Based on Unsupervised Pretraining of Wav2vec 2.0 on Large-Scale Unlabeled Family Audio [PDF] [Copy] [Kimi¹]

Authors: Jialu Li ; Mark Hasegawa-Johnson ; Nancy L. McElwain

To perform automatic family audio analysis, past studies have collected recordings using phone, video, or audio-only recording device like LENA, investigated supervised learning methods, and used or fine-tuned general-purpose embeddings learned from large pretrained models. In this study, we advance the audio component of a new infant wearable multi-modal device called LittleBeats (LB) by learning family audio representation via wav2vec 2.0 (W2V2) pretraining. We show given a limited number of labeled LB home recordings, W2V2 pretrained using 1k-hour of unlabeled home recordings outperforms oracle W2V2 pretrained on 52k-hour unlabeled audio in terms of parent/infant speaker diarization (SD) and vocalization classifications (VC) at home. Extra relevant external unlabeled and labeled data further benefit W2V2 pretraining and fine-tuning. With SpecAug and environmental speech corruptions, we obtain 12% relative gain on SD and moderate boost on VC. Code and model weights are available.

#17 Cues to next-speaker projection in conversational Swedish: Evidence from reaction times [PDF] [Copy] [Kimi¹]

Authors: Kathrin Feindt ; Martina Rossi ; Ghazaleh Esfandiari-Baiat ; Axel G. Ekström ; Margaret Zellers

We present first results of a study investigating the salience and typicality of prosodic markers in Swedish at turn ends for turn-yielding and turn-keeping purposes. We performed an experiment where participants (N=32) were presented with conversational chunks and, after the audio ended, were asked to determine which of two speakers would speak next by clicking a picture on a screen. Audio stimuli were manipulated by (i) raising and (ii) lowering fo over the last 500 ms of a turn, (iii) speeding up or (iv) slowing down duration over the last 500 ms, and (v) raising and (vi) lowering the last pitch peak. Out of all manipulations, increasing the speech rate was found to be the most disruptive p<.005). Higher speech rate led to longer reaction times in turn-keeping, which were shorter in turn-yielding. Other manipulations did not significantly alter reaction times. Results may be complemented with eye movement data, to elucidate cognitive mechanisms underlying turn-taking behavior.

#18 Multiple Instance Learning for Inference of Child Attachment From Paralinguistic Aspects of Speech [PDF] [Copy] [Kimi¹]

Authors: Areej Buker ; Huda Alsofyani ; Alessandro Vinciarelli

Attachment is a psychological construct that accounts for the way children perceive their relationship with their caregivers. Depending on the attachment condition, a child can either be secure or insecure. Identifying as many insecure children as possible is important to mitigate the negative consequences of insecure attachment in adult life. For this reason, this article proposes an attachment recognition approach that, compared to other approaches, increases the Recall, the percentage of insecure children identified as such. The approach is based on Multiple Instance Learning, a body of methodologies dealing with data represented as "bags" of feature vectors. This is suitable for speech recordings because these are typically represented as vector sequences. The experiments involved 104 participants of age 5 to 9. The results show that insecure children can be identified with Recall up to 63.3% (accuracy up to 75%), an improvement with respect to most existing models.

#19 A Multimodal Investigation of Speech, Text, Cognitive and Facial Video Features for Characterizing Depression With and Without Medication [PDF] [Copy] [Kimi¹]

Authors: Michael Neumann ; Hardik Kothare ; Doug Habberstad ; Vikram Ramanarayanan

Clinical depression is one of the most common mental disorders and technology for remote assessment of depression, including monitoring of treatment responses, is gaining more and more importance. Using a cloud-based multimodal dialog platform, we conducted a crowdsourced study to investigate the effect of depression severity and antidepressant use on various acoustic, linguistic, cognitive, and orofacial features. Our findings show that multiple features from all tested modalities show statistically significant differences between subjects with no or minimal depression and subjects with more severe depression symptoms. Moreover, certain acoustic and visual features show significant differences between subjects with moderately severe or severe symptoms who take antidepressants and those who do not take any. Machine learning experiments show that subjects with and without medication can be better discriminated from each other at higher severity levels.

#20 Understanding Disrupted Sentences Using Underspecified Abstract Meaning Representation [PDF] [Copy] [Kimi¹]

Authors: Angus Addlesee ; Marco Damonte

Voice assistant accessibility is generally overlooked as today's spoken dialogue systems are trained on huge corpora to help them understand the 'average' user. This raises frustrating barriers for certain user groups as their speech shifts from the average. People with dementia pause more frequently mid-sentence for example, and people with hearing impairments may mispronounce words learned post-diagnosis. We explore whether semantic parsing can improve accessibility for people with non-standard speech, and consequently become more robust to external disruptions like dogs barking, sirens passing, or doors slamming mid-utterance. We generate corpora of disrupted sentences paired with their underspecified Abstract Meaning Representation (AMR) graphs, and use these to train pipelines to understand and repair disruptions. Our best disruption recovery pipeline lost only 1.6% graph similarity f-score when compared to a model given the full original sentence.

#21 Developing Speech Processing Pipelines for Police Accountability [PDF] [Copy] [Kimi¹]

Authors: Anjalie Field ; Prateek Verma ; Nay San ; Jennifer L. Eberhardt ; Dan Jurafsky

Police body-worn cameras have the potential to improve accountability and transparency in policing. Yet in practice, they result in millions of hours of footage that is never reviewed. We investigate the potential of large pre-trained speech models for facilitating reviews, focusing on ASR and officer speech detection in footage from traffic stops. Our proposed pipeline includes training data alignment and filtering, fine-tuning with resource constraints, and combining officer speech detection with ASR for a fully automated approach. We find that (1) fine-tuning strongly improves ASR performance on officer speech (WER=12-13%), (2) ASR on officer speech is much more accurate than on community member speech (WER=43.55-49.07%), (3) domain-specific tasks like officer speech detection and diarization remain challenging. Our work offers practical applications for reviewing body camera footage and general guidance for adapting pre-trained speech models to noisy multi-speaker domains.

#22 Prosody-controllable Gender-ambiguous Speech Synthesis: A Tool for Investigating Implicit Bias in Speech Perception [PDF] [Copy] [Kimi¹]

Authors: Éva Székely ; Joakim Gustafson ; Ilaria Torre

This paper proposes a novel method to develop gender-ambiguous TTS, which can be used to investigate hidden gender bias in speech perception. Our aim is to provide a tool for researchers to conduct experiments on language use associated with specific genders. Ambiguous voices can also be beneficial for virtual assistants, to help reduce stereotypes and increase acceptance. Our approach uses a multi-speaker embedding in a neural TTS engine, combining two corpora recorded by a male and a female speaker to achieve a gender-ambiguous timbre. We also propose speaker-disentangled prosody control to ensure that the timbre is robust across a range of prosodies and enable more expressive speech. We optimised the output using an SSL-based network trained on hundreds of speakers. We conducted perceptual evaluations on the settings that were judged most ambiguous by the network, which showed that listeners perceived the speech samples as gender-ambiguous, also in prosody-controlled conditions.

#23 Affective attributes of French caregivers' professional speech [PDF] [Copy] [Kimi¹]

Authors: Jean-Luc Rouas ; Yaru Wu ; Takaaki Shochi

In this paper, we detail our approach to studying the vocal characteristics of caregivers in retirement homes. To achieve this goal, we conducted recordings of 20 professional caregivers across two retirement homes. Using headset microphones connected to smartphones, we were able to capture the caregivers' speech while allowing them complete freedom of movement without compromising sound quality. The recordings consisted of three tasks: reading text, informal interviews, and professional role-play scenarios with a fictitious patient. We processed the recordings using an automatic speech recognition system, which provided word or phone sequences and their corresponding timestamps. Our analysis focused on identifying differences in emotional tone, lexical content, speech rate, fundamental frequency, and intensity between spontaneous speech conditions. Ultimately, our aim is to develop automated training tools that capture the unique vocal characteristics of professional caregivers.

#24 An Automatic Multimodal Approach to Analyze Linguistic and Acoustic Cues on Parkinson's Disease Patients [PDF¹] [Copy] [Kimi¹]

Authors: Daniel Escobar-Grisales ; Tomás Arias-Vergara ; Cristian David Ríos-Urrego ; Elmar Nöth ; Adolfo M. García ; Juan Rafael Orozco-Arroyave

Early detection and monitoring of Parkinson's disease are crucial for properly treating and managing the symptoms. Automatic speech and language analysis has emerged as a promising non-invasive method to monitor the patient's state. This study analyzed different speech and language representations for automatic classification between Parkinson's disease patients and healthy controls. First, each modality is analyzed independently. General representations such as Wav2vec or BETO are used together with representations oriented to model disease traits such as phonemic identifiability in speech modality and grammatical units analysis in language modality. The best speech and language representations were combined using a fusion strategy based on Gated Multimodal Units. The best results are achieved with the multimodal approach, outperforming all results obtained with unimodal representations and the traditional fusion strategy.

#25 Personalization for Robust Voice Pathology Detection in Sound Waves [PDF] [Copy] [Kimi¹]

Authors: Khanh-Tung Tran ; Truong Hoang ; Duy Khuong Nguyen ; Hoang D. Nguyen ; Xuan-Son Vu

Automatic voice pathology detection is promising for non-invasive screening and early intervention using sound signals. Nevertheless, existing methods are susceptible to covariate shifts due to background noises, human voice variations, and data selection biases leading to severe performance degradation in real-world scenarios. Hence, we propose a non-invasive framework that contrastively learns personalization from sound waves as a pre-train and predicts latent-spaced profile features through semi-supervised learning. It allows all subjects from various distributions (e.g., regionality, gender, age) to benefit from personalized predictions for robust voice pathology in a privacy-fulfilled manner. We extensively evaluate the framework on four real-world respiratory illnesses datasets, including Coswara, COUGHVID, ICBHI and our private dataset - ASound, under multiple covariate shift settings (i.e., cross-dataset), improving up to 4.12% in overall performance.